{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variable magnitude\n",
    "\n",
    "### Does the magnitude of the variable matter?\n",
    "\n",
    "In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of **w** is partly determined by the magnitude of the units being used for **x**. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.\n",
    "\n",
    "In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, ...xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.\n",
    "\n",
    "Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore making feature scaling useful for Neural Networks as well as Logistic Regression.\n",
    "\n",
    "In Support Vector Machines, feature scaling can decrease the time to find the support vectors.\n",
    "\n",
    "Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.\n",
    "\n",
    "For more details on the above, follow the links in the Bonus Lecture of this section.\n",
    "\n",
    "In summary:\n",
    "\n",
    "#### Maginutd matters because:\n",
    "\n",
    "- The regression coefficient is directly influenced by the scale of the variable\n",
    "- Variables with bigger magnitudes /  value range dominate over the ones with smaller magnitudes / value range\n",
    "- Gradient descent converges faster when features are on similar scales\n",
    "- Feature scaling helps decrease the time to find support vectors for SVMs\n",
    "- Euclidean distances are sensitive to feature magnitude.\n",
    "\n",
    "#### The machine learning models affected by the magnitude of the feature are:\n",
    "\n",
    "- Linear and Logistic Regression\n",
    "- Neural Networks\n",
    "- Support Vector Machines\n",
    "- KNN\n",
    "- K-means clustering\n",
    "- Linear Discriminant Analysis (LDA)\n",
    "- Principal Component Analysis (PCA)\n",
    "\n",
    "#### Machine learning models insensitive to feature magnitude are the ones based on Trees:\n",
    "\n",
    "- Classification and Regression Trees\n",
    "- Random Forests\n",
    "- Gradient Boosted Trees\n",
    "\n",
    "\n",
    "**For more information on whether and why you should scale features prior to use in machine learning models, refer to the lecture \"Bonus: Additional reading resources on variable problems\", in the previous section of this course.** \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "===================================================================================================\n",
    "\n",
    "## Real Life example: \n",
    "\n",
    "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
    "\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "====================================================================================================\n",
    "\n",
    "To download the Titanic data, go ahead to this website:\n",
    "https://www.kaggle.com/c/titanic/data\n",
    "\n",
    "Click on the link 'train.csv', and then click the 'download' blue button towards the right of the screen, to download the dataset. Save it in a folder of your choice.\n",
    "\n",
    "**Note that you need to be logged in to Kaggle in order to download the datasets**.\n",
    "\n",
    "If you save it in the same directory from which you are running this notebook, and you rename the file to 'titanic.csv' then you can load it the same way I will load it below.\n",
    "\n",
    "====================================================================================================\n",
    "\n",
    "In this notebook, I will demonstrate the effect of feature magnitude on the performance of different machine learning algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "% matplotlib inline\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import AdaBoostClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data with numerical variables only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.2500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>38.0</td>\n",
       "      <td>71.2833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>26.0</td>\n",
       "      <td>7.9250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>35.0</td>\n",
       "      <td>53.1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8.0500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Survived  Pclass   Age     Fare\n",
       "0         0       3  22.0   7.2500\n",
       "1         1       1  38.0  71.2833\n",
       "2         1       3  26.0   7.9250\n",
       "3         1       1  35.0  53.1000\n",
       "4         0       3  35.0   8.0500"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the numerical variables of the Titanic Dataset\n",
    "data = pd.read_csv('titanic.csv', usecols = ['Pclass', 'Age', 'Fare', 'Survived'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>714.000000</td>\n",
       "      <td>891.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.383838</td>\n",
       "      <td>2.308642</td>\n",
       "      <td>29.699118</td>\n",
       "      <td>32.204208</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.486592</td>\n",
       "      <td>0.836071</td>\n",
       "      <td>14.526497</td>\n",
       "      <td>49.693429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.420000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>20.125000</td>\n",
       "      <td>7.910400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>14.454200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>31.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>512.329200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Survived      Pclass         Age        Fare\n",
       "count  891.000000  891.000000  714.000000  891.000000\n",
       "mean     0.383838    2.308642   29.699118   32.204208\n",
       "std      0.486592    0.836071   14.526497   49.693429\n",
       "min      0.000000    1.000000    0.420000    0.000000\n",
       "25%      0.000000    2.000000   20.125000    7.910400\n",
       "50%      0.000000    3.000000   28.000000   14.454200\n",
       "75%      1.000000    3.000000   38.000000   31.000000\n",
       "max      1.000000    3.000000   80.000000  512.329200"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's have a look at the values of those variables to get an idea of the magnitudes\n",
    "\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pclass _range:  2\n",
      "Age _range:  79.58\n",
      "Fare _range:  512.3292\n"
     ]
    }
   ],
   "source": [
    "# let's now calculate the range\n",
    "\n",
    "for col in ['Pclass', 'Age', 'Fare']:\n",
    "    print(col, '_range: ', data[col].max()-data[col].min())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The magnitude of the values of the 3 different variables and their ranges of values are quite different. Therefore, feature scaling could benefit the performance of several machine learning algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((623, 3), (268, 3))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's separate into training and testing set\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data[['Pclass', 'Age', 'Fare']].fillna(0),\n",
    "    data.Survived,\n",
    "    test_size=0.3,\n",
    "    random_state=0)\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Scaling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this demonstration, I will scale the features between 0 and 1. \n",
    "To learn more about this scaling visit the scikit-learn website: \n",
    "http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html\n",
    "\n",
    "Briefly, the transformation is given by:\n",
    "\n",
    "X_std = (X - X.min() / (X.max - X.min())\n",
    "\n",
    "And to transform the scaled feature back to its initial format:\n",
    "\n",
    "X_scaled = X_std * (max - min) + min\n",
    "\n",
    "\n",
    "**We'll see more on feature scaling in future lectures**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# scaling the features between 0 and 1. \n",
    "\n",
    "scaler = MinMaxScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean:  [ 0.64365971  0.30131421  0.06335433]\n",
      "Standard Deviation:  [ 0.41999093  0.21983527  0.09411705]\n",
      "Minimum value:  [ 0.  0.  0.]\n",
      "Maximum value:  [ 1.  1.  1.]\n"
     ]
    }
   ],
   "source": [
    "#let's have a look at the scaled training dataset\n",
    "print('Mean: ', X_train_scaled.mean(axis=0))\n",
    "print('Standard Deviation: ', X_train_scaled.std(axis=0))\n",
    "print('Minimum value: ', X_train_scaled.min(axis=0))\n",
    "print('Maximum value: ', X_train_scaled.max(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Logistic Regression roc-auc: 0.7134823539619531\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7080952380952381\n"
     ]
    }
   ],
   "source": [
    "# model build on unscaled variables\n",
    "\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train)\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test)\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.92574443, -0.01822382,  0.00233696]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logit.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Logistic Regression roc-auc: 0.7134931997136722\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7080952380952381\n"
     ]
    }
   ],
   "source": [
    "# model built on scaled variables\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train_scaled)\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test_scaled)\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1.85168371, -1.45774407,  1.1951952 ]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logit.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the performance of logistic regression did not change when using the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling). \n",
    "\n",
    "However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relative same effect (coefficient) for survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Support Vector Machines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "SVM roc-auc: 0.9016995292943752\n",
      "Test set\n",
      "SVM roc-auc: 0.6768154761904762\n"
     ]
    }
   ],
   "source": [
    "# model build on data with plenty of categories in Cabin variable\n",
    "\n",
    "SVM_model = SVC(random_state=44, probability=True)\n",
    "SVM_model.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = SVM_model.predict_proba(X_train)\n",
    "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = SVM_model.predict_proba(X_test)\n",
    "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "SVM roc-auc: 0.7047081408212403\n",
      "Test set\n",
      "SVM roc-auc: 0.6988690476190477\n"
     ]
    }
   ],
   "source": [
    "SVM_model = SVC(random_state=44, probability=True)\n",
    "SVM_model.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = SVM_model.predict_proba(X_train_scaled)\n",
    "print('SVM roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = SVM_model.predict_proba(X_test_scaled)\n",
    "print('SVM roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.901 for the model on unscaled features vs the roc-auc of 0.704). In addition, the roc-auc for the testing set increased as well (0.67 vs 0.69)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Neural Networks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Neural Network roc-auc: 0.678190277868159\n",
      "Test set\n",
      "Neural Network roc-auc: 0.666547619047619\n"
     ]
    }
   ],
   "source": [
    "# model built on unscaled features\n",
    "\n",
    "NN_model = MLPClassifier(random_state=44, solver='sgd')\n",
    "NN_model.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = NN_model.predict_proba(X_train)\n",
    "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = NN_model.predict_proba(X_test)\n",
    "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Neural Network roc-auc: 0.7161937918917161\n",
      "Test set\n",
      "Neural Network roc-auc: 0.7125\n"
     ]
    }
   ],
   "source": [
    "# model built on scaled features\n",
    "\n",
    "NN_model = MLPClassifier(random_state=44, solver='sgd')\n",
    "NN_model.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = NN_model.predict_proba(X_train_scaled)\n",
    "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = NN_model.predict_proba(X_test_scaled)\n",
    "print('Neural Network roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that scaling the features improved the performance of the neural network both for the training and the testing set (compare roc-auc values: training 0.67 vs 0.71; testing: 0.66 vs 0.71). The roc-auc increases in both training and testing sets when the model is trained on a dataset with scaled features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Nearest Neighbours"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "KNN roc-auc: 0.8694225721784778\n",
      "Test set\n",
      "KNN roc-auc: 0.6253571428571428\n"
     ]
    }
   ],
   "source": [
    "#model built on unscaled features\n",
    "\n",
    "KNN = KNeighborsClassifier(n_neighbors=3)\n",
    "KNN.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = KNN.predict_proba(X_train)\n",
    "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = KNN.predict_proba(X_test)\n",
    "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "KNN roc-auc: 0.8880555736318084\n",
      "Test set\n",
      "KNN roc-auc: 0.7017559523809525\n"
     ]
    }
   ],
   "source": [
    "# model built on scaled\n",
    "\n",
    "KNN = KNeighborsClassifier(n_neighbors=3)\n",
    "KNN.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = KNN.predict_proba(X_train_scaled)\n",
    "print('KNN roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = KNN.predict_proba(X_test_scaled)\n",
    "print('KNN roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe for KNN as well that feature scaling improved the performance of the model. The model built on unscaled features shows a better generalisation, with a higher roc-auc for the testing set (0.70 vs 0.62 for model built on unscaled features).\n",
    "\n",
    "Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Forests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Random Forests roc-auc: 0.9914589705212468\n",
      "Test set\n",
      "Random Forests roc-auc: 0.7602678571428572\n"
     ]
    }
   ],
   "source": [
    "# model built on unscaled features\n",
    "\n",
    "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n",
    "rf.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = rf.predict_proba(X_train)\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = rf.predict_proba(X_test)\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Random Forests roc-auc: 0.9914589705212469\n",
      "Test set\n",
      "Random Forests roc-auc: 0.7602380952380952\n"
     ]
    }
   ],
   "source": [
    "# model built in scaled features\n",
    "rf = RandomForestClassifier(n_estimators=700, random_state=39)\n",
    "rf.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = rf.predict_proba(X_train_scaled)\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = rf.predict_proba(X_test_scaled)\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "AdaBoost roc-auc: 0.847736491616234\n",
      "Test set\n",
      "AdaBoost roc-auc: 0.7733630952380953\n"
     ]
    }
   ],
   "source": [
    "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n",
    "ada.fit(X_train, y_train)\n",
    "print('Train set')\n",
    "pred = ada.predict_proba(X_train)\n",
    "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = ada.predict_proba(X_test)\n",
    "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "AdaBoost roc-auc: 0.847736491616234\n",
      "Test set\n",
      "AdaBoost roc-auc: 0.7733630952380953\n"
     ]
    }
   ],
   "source": [
    "ada = AdaBoostClassifier(n_estimators=200, random_state=44)\n",
    "ada.fit(X_train_scaled, y_train)\n",
    "print('Train set')\n",
    "pred = ada.predict_proba(X_train_scaled)\n",
    "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = ada.predict_proba(X_test_scaled)\n",
    "print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
